Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.
I hope that I could learn some useful techique with R, and that I could analysis data after the semester ends.
Describe the work you have done this week and summarize your learning.
The is a data with 60 variables. Through analyzing the data, we hope to undertand what are the important variables which is related to exam points.
step 1:Data Cleaning To analyze the data, the first step is to clean the data (scale the “Attitude” column), and select the information we are interested in. Since there are too many variables (183 observations and 60 variables), and this would make it hard to make analysis, I combine some of the variables, and put them in three big categories, which are deep, surface, strategic. After that, I average the value of deep_columns, surface_columns, strategic_columns.In the end, I keep the rows where points is greater than zero.These are what I do in “data cleaning” step.
step 2: Show a graphical overview of the data and show summaries of the variables in the data.
## Loading required package: ggplot2
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
After data cleaning step, the data now has 166 observations and 7 variables and I start drawing some plots.
The plots show us the distribution between two variables, and each varible. They also show us data distribytion by gender. I found that there is a positive correlation between attitude and points (0.43), deep and surf has negative correlation. From the box charts, I found that the data of age, deep questions is more condensed. However, the value age has lots of outliers.
From the summary, we see below information, which shows the minimum, maximum, mean, quartile value of each variable(age, attitude, deep)
step 3:Choose attitude, deep question, strategic question variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable.
##
## Call:
## lm(formula = points ~ attitude + deep + stra, data = students2014)
##
## Coefficients:
## (Intercept) attitude deep stra
## 11.3915 3.5254 -0.7492 0.9621
##
## Call:
## lm(formula = points ~ attitude + deep + stra, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## deep -0.7492 0.7507 -0.998 0.31974
## stra 0.9621 0.5367 1.793 0.07489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
I choose points as Y value, and attitude, deep question, strategic question as X value to create a multiple regression. According to the summary result, we can see that p-value is 2.521e-08, which is smaller than 0.05, hence we can say that the model is reasonable. However, since standard error is quite large, the estimation isn’t too precise. Though multiple R-squared, adjusted R-squared are both low, which are 0.2097, 0.195, we couldn’t underestimate the model’s explanatory ability, since lots of factors should be taken into account, and that R-squared isn’t the only element to consider a regresison model.
step 4: Produce Residuals vs Fitted plot, Normal QQ-plot and Residuals vs Leverage plot.
1.Residuals vs Fitted plot: A “good” residuals vs. fitted plot should has no obvious outliers, and be generally symmetrically distributed around the 0 line without particularly large residuals. From the plot, we can see that X and Y values are not correlated, hence this is a suitable model.
2.Normal QQ-plot: According to the theory, if both sets of quantiles come from the same distribution, we should see the points forming a line that’s roughly straight. Also, the points should fall approximately along the 45 degree reference line. From the plot, we could see that the points fall approximately on the 45-degree reference line, which means that the data sets come from similar distributions.
3.Residuals vs Leverage plot: This plot helps identify influential data points on the model. The points which are outside the red dashed Cook’s distance line are the points that would be influential in the model, and removing them would likely noticeably alter the regression results.
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## 'data.frame': 382 obs. of 35 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
This is adata discussing about students’ alchohol consumption.There are 382 observations and 35 variables. Variables include student’s, sex, age,family size,alcohol consumption, parents’education status, job, and so on. Through the analysis, I want to study the relationships between high/low alcohol consumption and some of the other variables in the data.
I assume that “studytime”(weekly study time), “failures”(number of past class failures), “goout”(frequency of going out with friends), “freetime”(free time after school), are important variables which have strong relationship with the consumption of alchohol. My hypothesis is that students who have less study time per week, fail more classes, go out with friends more often, have more free time after school consume more alchohol.
1.Numerically and graphically explore the distributions
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_study_time
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 157 2.34
## 2 F TRUE 41 2
## 3 M FALSE 113 1.88
## 4 M TRUE 71 1.62
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_failures
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 157 0.204
## 2 F TRUE 41 0.439
## 3 M FALSE 113 0.239
## 4 M TRUE 71 0.479
According to the summary statistics of study time group by sex and high_use, female who has shorter study time tend to consume more alchohol (more than 2 times per week), and those who studies longer tend not to consume that much alchohol(less than 2 times per week). The phenomenon is same for male. The result corresponds to my assumption.
Through the summary statistics of failures group by sex and high_use, female who failed more classes in the past tend to consume more alchohol, and those who failed less classes tend not to consume that much alchohol. This also happens at male. The result corresponds to my assumption.
The boxplots shows that for variable “goout”, female who consumes more alchohol goes out more. Male also has the same same situation. The result corresponds to what i assumed. And for variable “freetime”, female who consumes more alchohol has more freetime after school; however, the phenomenon is not too significant for male. The result is similiar to my assumption, but doesn’t exactly match.
2.Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.
I choose “studytime”, “failures”, “goout”, “freetime” as four X variables, and fit them into the model which Y is “high_use” (high/low alcohol consumption). I also separate the data into male and female, in order to dig more into the data. Below are what I found from the model.
##
## Call:
## glm(formula = high_use ~ studytime + failures + goout + freetime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8214 -0.7528 -0.5442 0.8552 2.4579
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.36957 0.62399 -3.797 0.000146 ***
## studytime -0.57481 0.16784 -3.425 0.000615 ***
## failures 0.19303 0.16899 1.142 0.253334
## goout 0.70490 0.12039 5.855 4.77e-09 ***
## freetime 0.07209 0.13531 0.533 0.594163
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 395.17 on 377 degrees of freedom
## AIC: 405.17
##
## Number of Fisher Scoring iterations: 4
## (Intercept) studytime failures goout freetime
## -2.36956938 -0.57481413 0.19303395 0.70489610 0.07209276
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.09352099 0.0267779 0.3109179
## studytime 0.56280947 0.4007339 0.7752293
## failures 1.21292398 0.8699068 1.6929038
## goout 2.02363642 1.6081702 2.5811264
## freetime 1.07475503 0.8240334 1.4026604
The summary shows that standard error for “studytime”, “failures”, “goout”, “freetime” are 0.79, 0.79, 0.74, 0.75, which are comparatively small. Smaller standard error indicates that sample mean and the population mean is more similar, which means that sample data has a stronger explanatory power to the population. The coefficient for “studytime”, “failures”, “goout”, “freetime” are -2.94, -2.18, -1.67, -2.3, which means that these variables are all highly correlated with Y(“high_use”).
The odds ratio of “studytime” is 0.654(less than 1), “failures” is 1.34(higher than 1), “goout” is 2.114(higher than 1), “freetime” is 1.164(higher than 1), which means that “failures”, “goout”, “freetime” is positively associated with “high_use”. According to the confidence intervals,
According to the above result, I now figure out that “failures”, “goout”, “freetime” are important variables, which are correlated with high/low alcohol consumption. Since the odds ratio of “studytime” is less than one, it isn’t a factor which has high relationship with high/low alcohol consumption.
3.Using the variables which has statistical relationship with high/low alcohol consumption to explore the predictive power of you model.
## prediction
## high_use FALSE TRUE
## FALSE 248 22
## TRUE 76 36
Since I found that “studytime” isn’t an important factor, I remove this variable, and predict a model. According to the confusion matrix, we can calculate the precision is 248/(248+76) = 0.77, and the recall is 248/ (248+22) = 0.92. This means that the model has high recall, low precision, which means that most of the positive examples are correctly recognized, but there are a lot of false positives.The average number of wrong predictions in training data is 0.2565445, and the average number of wrong predictions in the cross validation 0.2486911. This means that the error of the model is quite high(around 25%).
1.Show a graphical overview of the data and show summaries of the variables in the data.
## corrplot 0.84 loaded
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Human data includes 155 observations and 8 variables. The variables are “Edu2.FM”, “Labo.FM”, “Edu.Exp”, “Life.Exp”, “GNI”, “Mat.Mor”, “Ado.Birth”, “Parli.F”.
GGpairs shows the correlation between two variables, and I find that “Ado.Birth” and “Edu.Exp”, “Ado.Birth” and “Life.Exp”, “Mat.Mor” and “Edu.Exp”, “Mat.Mor” and “Life.Exp” have high negative correlation; “Life.Exp” and “Edu.Exp”, “Ado.Birth” and “Mat.Mor”have high positive correlation.
Corplot shows a better visualization than ggpairs, it shows the correaltion between each two variabes with different color. The two variables are more negative correlated if the color is red, and they are more positive correlated if the color is blue. However, corplot only shows general relationship between two variables, it can’t show the exact correlation.
2.Perform principal component analysis (PCA) on the not standardized human data.
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000
## PC8
## Standard deviation 0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Since the variables are not standarized, the standard deviation is big, which means the value of the data is discreted. From the biplot, we can see that most data of the countries cluster together, the variables also have the same problem.
3.Standardize the variables in the human data and repeat the above analysis. Are the results different? Why or why not?
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
The result before and after the data is standardized is different. Before the data is standardized, countries and other variables all gather together, which makes it hard to interpret; after the data is standardized, countries are more evenly distributed, and the other variables have more similar standard deviations( since the length of the arrows are almost the same).
4.Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data.
After the human data is standardized, countries are distributed more evenly. The arrows shows the connections between the original features and the PC’s(PC1, PC2). The countries are placed on x and y coordinates defined by two PC’s. The angle between arrows represents the correlation between the features. Small angle = high positive correlation. We can see that except the correlation between “Parli.F”, “Labo.FM” and PC1, PC2, other variables all have high positive correlation with the PC’s.
The length of the arrows are proportional to the standard deviations of the features, from the plot, we can see that the variables have similar standard deviations.
5.Look at the structure and the dimensions of the tea data and visualize it. Interpret the results of the MCA and draw at least the variable biplot of the analysis.
Tea dataset includes 300 observations and 6 variables,which are:
“Tea”" : Factor 3 levels “black”,“Earl Grey”, “green” “How”" : Factor 4 levels “alone”,“lemon”, “milk”, “other” “how” : Factor 3 levels “tea bag”,“tea bag+unpackaged”, “unpackaged” “sugar” : Factor 2 levels “No.sugar”,“sugar” “where” : Factor 3 levels “chain store”, “chain store+tea shop”, “tea shop”
“lunch” : Factor 2 levels “lunch”,“Not.lunch”
The summary shows the detail of the data. It shows the amount of each variables as below:
Tea: How: how: sugar:
black : 74 alone:195 tea bag :170 No.sugar:155
Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
green : 33 milk : 63 unpackaged : 36
other: 9
where: lunch:
chain store :192 lunch : 44
chain store+tea shop: 78 Not.lunch:256
tea shop : 30
In tea variable, most data are “Earl Grey”(193); in How variable, most data are “alone”(195); in how variable, most data are “tea bag”(170); in sugar variable, most data are “No.sugar”(155); in where variable, most data are “chain store”(192); in lunch variable, most data are “Not.lunch”(256)
## Warning: attributes are not identical across measure variables;
## they will be dropped
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
The visualization of the dataset visualize the summary, thus is easier for me to interpret the data. Next, I do multiple correspondence analysis. The summary shows the eigenvalues, individuals, categories and categorical variables. According to eigenvalues, we can see that Dim.1 and Dim.2 retain more percentage of variances than other dimensions. From v.test value in categories, the coordinate of “black”, “Earl Grey”, “green”, “lemon”, “milk”, “tea bag”, “tea bag+unpackaged”, “unpackaged” is significantly different from zero (since the value is below/above ± 1.96). According to categorical variables, we can see that “how” and “Dim.1”, “where” and “Dim.1”, have a stronger correlation.
MCA biplot shows the possible variable pattern. The distance between variables show the similarity between variables. For example, “lemon” and “alone” are more similar than “lemon” and “other”.